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With the increasing capacity of commodity DRAM, a dichotomy 
has appeared between memory requirements of PC graphics 
systems, and standard DRAM for PC main memory. PC graphics 
subsystems require wide, relatively shallow memory for hi^ 
data bandwidth while commodity DRAM is too narrow and too 
deep. Integrating DRAM frame buffer and graphics accelerator 
logic on the same chip solves this problem (Figure 1). 

Reported integrated DRAM and logic devices separate DRAM 
and ASIC logic portions with a traditional memory interface [1]. 
Although this approach has a number of benefits over the 
discrete solution, much greater improvements are available by 
more tightly integrating the DRAM and logic, something not 
possible at the board level. This device integrates parts of ihe 
graphics processor witihin the DRAM to increase performance. 
Figure 2 shows the architecture of one bank of the frame buffer, 
where the pixel processing imit (PPU) and the serial output 
registers (SORs) are integrated into the DRAM architecture [2]. 
This allows the bus width between the DRAM frame buffer and 
the processor to be 4096b. 

Figure 3a shows a traditional DRAM, where a relatively narrow 
databuE runs the length of the sense amp straps. To allow for 
massively-parallel DRAM access, the frame buffer uses 
databuses orthogonal to the standard direction (Figure 3b). 
These databuses can be routed in metal2 CM2) over an array 
with capacitor-over-bitline processing. This allows up to 1 
databus for each column of DRAM, depending on the M2 pitch of 
the process. The device employs 8 banks, each of 512b databus 
width. 

The PPU is 512b wide per bank, requiring one bit of the PPU to 
pitch-match to 4 DRAM columns. With this architecture, the 
choice of circuits included in this massively parallel processor 
must be judicious. It is limited to the most basic pixel operations: 
the raster operations. Figure 4 shows a single bit of the PPU, and 
how it is pitch-matched to the DRAM to accelerate raster 
operations. The PPU registers are built using dual port 6T SRAM 
cells as shown in Figure 5a, while tiie rastop function unit is 
implemented with a pMOS based 8-1 multiplexor as shown in 
Figure 6b. This allows the PPU to perform any of the 256 rastops 
possible with 3 input variables. All 512b are identical, and since 
there is no dataflow directly between bits, redundant PPU bits 
can be easily added (pitch-matched to the DRAM redundant 
columns). This allows more aggressive core rules for the PPU 
to reduce area. 

This frame buffer architecture allows the graphics controller to 
accelerate some of the most common graphic operations. For 
instance, a block move can require a source pixel read, a 
destination pixel read, a rastop, and a destination pixel write. In 
a typical sjrstem with a 64b memory interface, this is done 64b at 
a time. In this chip, all reads can be done at up to 4096b wide, 
as can the actual raster operation, and the write back. For 
operations that require realignment of source pixels to destination 



pixels, a 32b funnel shifter is built under the FB data-bus in each 
bank, allowing realignment of 256b at a time. Also, a 32b word can 
be written to all 16 PPU words simultaneously. If all 8 banks are 
enabled, this allows for a 4096b write to DRAM in a single cycle, 
useful for screen clearing. 

The gate array logic incorporates the graphic accelerator, PCI 
interface, VGA core and video input block to provide the necessary 
system level functionality for a PC graphics device. ^Uiin the 
accelerator logic the highly-parallel nature of the DRAM and its 
embedded logic gives a much larger control space than a normal 
external DATRAM interface. This requires a novel approach to 
BITBlt operation since data alignment, manipulation(ROP) and 
masking must be performed in the most optimal way to achieve 
the best performance. The availabiUty of 5 512B registers within 
the PPUs allows data to be cached for some operations to increase 
performance further. The custom-designed pixel output path 
(POP) logic includes a 64b hardware ciursor, a video scaler with a 
4-tap FIR filter for horizontal scaling, and 2-tap for vertical 
scaling, color space conversion circuits, and a IdSMHz LUTDAC. 
This device uses dual fully integrated PLLs, one for the 66MH2 
main dock (MCLK), and one for the 135MH2 pixel clock (PCUO. 

Noise is a concern on a mised DRAM/logic, chip. Noise {torn the 
ASIC logic can cause a degradation of the refresh time for DRAM 
cells, Refr^h time is programmable, to allow the use of tJie 
maximum refresh time possible after processing. The device 
employs separate power pins and on-chip power rails for DRAM, 
gate array, PLLs, DACs, and POP logic. Critical circuitry is 
guarded with substrate and well guard rings. 

Testing methodologies have yet to be standardized for mixed 
DRAMflogic chips. Although BIST is usefiil for embedded SRAMs, 
redundancy and complicated cell coupUng tests make it difficult 
to use for large embedded DRAMs. BIST also complicates the 
debug of an embedded DRAM, since the controUabiHty and 
observability of that DRAM is severely hmited. In this device all 
DRAM data and control is multiplexed onto the device pins in test 
mode, using an on-chip test interface. This circuit makes the 
entire frame buffer appear similar to a standard asynchronous 
DRAM in memory test , allowing use of standard DRAM test 
software. All bits of the SORs and PPUs are mapped into unused 
row addresses for test. Even the rastop processor is mapped to 
unused row addresses to allow the entire frame buffer to be 
easily tested on a standard DRAM tester. The digital logic blocks 
are tested using full scan methodology and functional test 
vectors, and the test interface allows access for these paths. For 
device bum-in only, the logic incorporates a BIST block for the 
DRAM array. This ensures that the DRAM is cycled during post- 
package bum-in testing with only the requirement for power, 
ground, and a few control pins connected to a tester. 

The device is implemented in a O.SS^m/O.S^mi blended DRAM and 
logic process Tising triple-level-metal. Table 1 shows the fea- 
tures of the chip. Figure 6 is a chip micrograph. 
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Figure 1: Chip architecture. 
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Figure 2: Frame buffer bank.Figure 3a: Traditional 
DRAM. 
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Figure 3a: Traditional DRAM. 




Figure 5a: PPU register bit. 
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Figure 3b: Wide databus scheme. 
Figure 4: See page 461. 



IT 



Figure 5b: RASTOP processor. 
Figure 6 and Tabic 1: See page 461. 
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Maximum screen sizes 1280xl024xdb/pixel 
1024x768xl6h^ixel 
800x600x24h/pixel 

Technology 0.35jun/0.5jim BlendlC, 

DRAM 13.4Mb 

Pitch-matched logic 160k gates 

SRAM 38kb 

Maximimi system dock rate66MHz 

Maximum pixel clock rate 135MHz 

Supply voltage 3.3V 

Package 208-pin QFP 



Table 1: Chip overview. 
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Figure 4: Pixel processing unit. 



Figure 6: Chip micrograph. 



